Tools and Techniques
This workshop provides an overview of topics and practical examples of using Machine Learning tools on HPC resources. ⭐
🔑 Key Topics:
For suggestions: cpeterson@oarc.ucla.edu
This presentation and accompanying materials are available on 🔗 UCLA OARC GitHub Repository
You can view the slides in:
Each file provides detailed instructions and examples on the various topics covered in this workshop.
Note: 🛠️ This presentation was built using Quarto and RStudio.
🚀 HPC uses MANY computers to solve large problems faster than a normal computer.
🕒 If your task takes a long time to run on a laptop or a lab’s server, HPC can ‘speed up’ your application.
📚 You can store LARGE amounts of data, too big for your laptop.
🤝 HPC provides an excellent platform for collaborations and faster results.
Beowulf style cluster
Multiple computers
Single computing resource
🤖 Most Machine Learning packages can utilize multiple CPUs and GPUs to run your models in parallel!
Hoffman2 supports running 🐍 Python applications.
Hoffman2 supports 🐍 Python applications, and it is HIGHLY recommended to use Python versions built and tested by Hoffman2 staff.
🚫 Avoid using system python builds (e.g., /usr/bin/python). Instead, use module load commands to access optimized versions.
/u/local/apps/python/3.7.3/gcc-4.8.5/bin/python3Basic Builds on Hoffman2: - The Python builds on Hoffman2 include only the basic compiler/interpreter and a few essential packages.
$HOME, $SCRATCH, or any project directories.Installation Methods:
scikit-learn package via pip (PyPI) package manager:Understanding the –user Flag:
--user flag ensures the package installs in your $HOME directory.--user, packages install in $HOME/.local, avoiding permission errors.Finding Available Versions of R:
Loading a Specific Version of R: Example to load R version 4.2.2 with GCC version 10.2.0:
Ensuring Correct Module Loads:
-🔧 Load the gcc or intel modules first, as indicated by modules_lookup. This step ensures that the correct versions of gcc and intel libraries are loaded for R.
Standard Installation Command:
🏠 R will suggest a new path in your $HOME directory, determined by $R_LIBS_USER.
Each R module on Hoffman2 has a unique $R_LIBS_USER to prevent conflicts between different R versions.
Anaconda is a popular Python and R distribution, ideal for simplifying package management and pipelines.
Hoffman2 has Anaconda installed, allowing users to create their own conda environments.
Warning
🚫 No Need for Other Python/R Modules:
Note
For more information, we had done a workshop on using Anaconda on Hoffman2 that you can review.
Containers, like Apptainer and Docker, are excellent for running Machine Learning applications on Hoffman2.
This example focuses on the “Fashion MNIST” dataset, a collection used frequently in machine learning for image recognition tasks.
Approach:
Dataset Overview:
Using Scikit-learn with Python:
Package Installation:
Python:
R:
Getting Started with Interactive Compute Node
Cloning and Navigating to the Code Repository
Lets look at the code, minst.py
Running the Python Script:
The initial training took about 1 minute over 1 CPU core.
Speeding Up with Parallel Processing:
Note: Use the shared parallel environment as sci-kit learn doesn’t support multi-node parallelism.
Code Adjustment for Parallelism:
minst-par.pyRun the code!
Submitting Non-Interactive Jobs: - For tasks that don’t require interactive sessions, you can submit jobs to be processed in the background.
Command to Submit a Job: - Use the qsub command to submit your job script to the queue:
Advantages:
Executing Code with a Single CPU:
Running Code with Parallel Processing (10 CPUs):
Submitting as a Batch Job:
🧬 DNA Sequence Classification with PyTorch
Setting Up for GPU-Enabled PyTorch:
Code Location and Versions:
dna-ex directory.dna-cpu.py for the CPU version.dna-gpu.py for the GPU version.The term Big Data refers to datasets and data science tasks that become too large and complex for traditional techniques.
Explore various frameworks, APIs, and libraries for handling Big Data
Dealing with extensive DATA presents unique challenges 😰:
Image source - DASK https://ml.dask.org/index.html
Using Spark’s MLlib for Music Data Analysis:
Dataset Characteristics:
Creating and Activating the Conda Environment:
mypyspark:module load anaconda3
conda create -n mypyspark openjdk pyspark python \
pyspark=3.3.0 py4j jupyterlab findspark \
h5py pytables pandas matplotlib \
-c conda-forge -c anaconda -y
conda activate mypyspark
pip install ipykernel
ipython kernel install --user --name=mypysparkEnvironment Features:
Let’s practice basic PySpark functions with examples.
MSD.ipynb from MSD_exDownloading the Dataset:
We will use the h2jupynb script to start Jupyter on Hoffman2
You will run this on your LOCAL computer.
wget https://raw.githubusercontent.com/rdauria/jupyter-notebook/main/h2jupynb
chmod +x h2jupynb
#Replace 'joebruin' with you user name for Hoffman2
#You may need to enter your Hoffman2 password twice
python3 ./h2jupynb -u joebruin -t 5 -m 10 -e 2 -s 1 -a intel-gold\\* \
-x yes -d /SCRATCH/PATH/WS_MLonHPC/MSD_exNote
The -d option in the python3 ./h2jupynb will need to have the $SCRATCH/WS_MLonHPC full PATH directory
This will start a Jupyter session on Hoffman2 with ONE entire intel-gold compute node (36 cores)
More information on the h2jupynb can be found on the Hoffman2 website
AutoML, or Automated Machine Learning, is an innovative approach to automating the process of applying machine learning to real-world problems.
Key Benefits
Components of AutoML
HPC resouces can be use to echance AutoML since it can be very computationally demanding
H2O.ai AutoML: An open-source platform that automates the process of training and tuning a large selection of candidate models within H2O, a popular machine learning framework.
Auto-sklearn: An automated machine learning toolkit based on the scikit-learn library, focusing on automating the machine learning pipeline, including preprocessing, feature selection, and model selection.
TPOT (Tree-based Pipeline Optimization Tool): An open-source Python tool that uses genetic algorithms to optimize machine learning pipelines.
MLBox: A powerful Automated Machine Learning python library that provides robust preprocessing, feature selection, and model tuning capabilities.
Auto-Keras: Auto-Keras is a AutoML program built on the Keras platform.
Setting Up H2O.ai for Automated Machine Learning:
h2oai:🚀 Your environment is now set up with H2O.ai, ready for AutoML tasks.
Exploring AutoML with H2O: